Parallel Training Data Selection for Conversational Machine Translation
نویسندگان
چکیده
We describe data selection strategies for an English-French Skype conversation translation task without in-domain training data. Selection methods based on language modeling and text formality criterion are evaluated. Our main finding is that translating conversation transcripts turned out to not be as challenging as we expected: while translation quality is of course not perfect, a straightforward phrase-based system trained on movie subtitles yields high BLEU scores, and small improvements are obtained by using a simple heuristic to select more Skype-like examples.
منابع مشابه
Applying Cross-Entropy Difference for Selecting Parallel Training Data from Publicly Available Sources for Conversational Machine Translation
Cross Entropy Difference (CED) has proven to be a very effective method for selecting domain-specific data from large corpora of out-of-domain or general domain content. It is used in a number of different scenarios, and is particularly popular in bake-off competitions in which participants have a limited set of resources to draw from, and need to sub-sample the data in such a way as to ensure ...
متن کاملImproving Statistical Machine Translation Performance by Training Data Selection and Optimization
Parallel corpus is an indispensable resource for translation model training in statistical machine translation (SMT). Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting full potential of the existing parallel corpora. Two kinds of methods are proposed: offline data optimization and online model optimization. The offline method...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملTrain the Machine with What It Can Learn - Corpus Selection for SMT
Statistical machine translation relies heavily on available parallel corpora, but SMT may not have the ability or intelligence to make full use of the training set. Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting the full potential of existing parallel corpora. We first identify literally translated sentence pairs via lexic...
متن کاملPhd Defense Presentation 2219 Engineering Building " Da a Analy I and Selec Ion for S a I Ical Macine Tran La Ion "
Statistical Machine Translation has received significant attention from the academic community over the past decade. This research has led to significant improvements in machine translation quality. As a result, it is widely adopted in the industry (Google, Microsoft, Twitter, Facebook, ...etc.) as well as the government (http:/ /nist.gov). The biggest factor in this improvement has been the av...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016